Sentiments analysis challenge on movie review

 Designed by : Il faut sauver les datas, Ryan !

Boosz Paul 
Estrade Victor 
Gensollen Thibaut 
Rais Hadjer 
Sakly Sami 

Introduction

The data set is composed of an equilibred number of positive /negative movie review. It will be on the form of 3 different CSV files.

  • Train.csv contains 10k examples with 3 columns : id, label, review
  • Test_public.csv 5k examples with 3 columns : id, label, review
  • Test_private.csv contains 10k examples with only 2 columns : id, review

The goal is to predict the predicted_label column. The prediction quality is measured by the precision metrics.

Results should be a txt file or csv file with 1 column : the predicted_class {0,1} as shown in this toolkit. You have to keep the original order of the datasets.

Fetch the data and load it in pandas

The first things to do is to dawload all the data at the website : https://competitions.codalab.org/competitions/8131#learn_the_details-description

To rename them (remove the keys 'datasets_None_0b3a301a-be2e-4f21-8be9-dfa5c56439c4') to their original names:

  • train.data
  • valid.data
  • test.data
  • train_preprocessed.data
  • valid_preprocessed.data
  • test_preprocessed.data

and place them in a 'data/' folder.


In [1]:
from __future__ import division, print_function
import pandas as pd
import numpy as np


Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days

In [2]:
data_dir = 'data/'

In [3]:
# Load Original Data / contains data + labels 10 k 
train = pd.read_csv("../data/train.data")#.drop('id',axis =1 )
# Your validation data / we provide also a validation dataset, contains only data : 5k
valid = pd.read_csv("../data/valid.data")#.drop('id',axis =1 )
# final submission
test = pd.read_csv("../data/test.data")#.drop('id',axis =1 )

In [4]:
print("train size", len(train))
print("public test size", len(valid))
print("private test size",len(test))


train size 10000
public test size 5000
private test size 10000

In [5]:
# creating arrays from pandas dataframe
X_train = train['review'].values
y_train = train['label'].values
X_valid = valid['review'].values
X_test = test['review'].values
print("raw text : \n", X_train[0])
print("label :", y_train[0])


raw text : 
 Yes, it's another great magical Muppet's movie and I adore them all; the characters, the movies, the TV show episodes (it's the best comedy or musical TV show ever) and all the artists behind it. But here they did such a rare fatal mistake and I'm surely talking about the weird ending !! <br /><br />I think it's very dangerous to involve that much, in American drama, and end a love affair by marriage !! We, as all the poor viewers, feel so free or maybe happy for the absence of its annoyance, peevishness and misery ! So we all enjoy these stories which gather 2 cute heroes as couple in love without the legitimate bond like Mickey Mouse and Minnie, Superman and Lois Lane, Dick Tracy and Tess, etc. So with all of the previous couples and their likes I bet that you feel safe, serenity and peace. Therefore when you look at what the makers of this movie had already done you'll be as mad as me !<br /><br />They made the weak miserable creature (Kermit) marry his daily nightmare, the most vexatious female ever (Miss Piggy) ! This is a historical change by the measures of the American entertainment's industry ! And it was pretty normal to have a negative impact upon the audience whom just refused to bless or believe or being satisfied with that sudden marriage (even the pathetic frog didn't have the time or the proper opportunity to think or to decide anything !). Therefore no wonder at all when you know that this movie is the most failure one in their cinematic serious, grossing only 25 millions vis-à-vis 65 millions earned by the first one (The Muppet Movie – 1979) five years earlier !!<br /><br />Simply in this movie they took Manhattan, and my rest too !
label : 1

In [6]:
print(len(X_test))


10000

Trainning and testing the model with cross validation.


In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# creating random forest classifier 
rfst = RandomForestClassifier(n_estimators = 100)
# TfIdf Vectorizer with default parameters 
myTfidfVect = TfidfVectorizer(stop_words='english', max_features=30000)
X_train_transformed = myTfidfVect.fit_transform(X_train)

The next cell may take some time.


In [8]:
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(rfst, X_train_transformed, y_train,
               scoring='accuracy', cv=5)
print('accuracy :', np.mean(scores), '% +/-', np.std(scores), '%')


accuracy : 0.8373 % +/- 0.00580172388174 %

Trainning the model on the complete trainning dataset.


In [9]:
rfst.fit(X_train_transformed, y_train)
print('Model trainned.')


Model trainned.

Get the predictions.


In [10]:
X_valid_transformed = myTfidfVect.transform(X_valid)
X_test_transformed = myTfidfVect.transform(X_test)

In [11]:
prediction_valid = rfst.predict(X_valid_transformed)
prediction_test = rfst.predict(X_test_transformed)

In [12]:
pd.DataFrame(prediction_valid[:5], columns=['prediction'])


Out[12]:
prediction
0 0
1 0
2 1
3 0
4 0

Save the results.


In [13]:
import os
if not os.path.isdir(os.path.join(os.getcwd(),'results')):
    os.mkdir(os.path.join(os.getcwd(),'results'))
np.savetxt('results/valid.predict', prediction_valid, fmt='%d')
np.savetxt('results/test.predict', prediction_test, fmt='%d')

The last operation is to zip the results. Zip only the 'valid.predict' and the 'test.predict' files not the results directory !